{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Pandas Dataframe Exercises\n",
    "## Try me\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ffraile/computer_science_tutorials/blob/main/source/Data%20Manipulation/exercises/Pandas%20Dataframes.ipynb)[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ffraile/computer_science_tutorials/main?labpath=source%2FData%20Manipulation%2Fexercises%2FPandas%20Dataframes.ipynb)\n",
    "\n",
    "In the first exercises, we are going to use the open dataset from the National Institute of Diabetes and Digestive and Kidney Diseases which is available in Kaggle. The dataset contains information about patients with diabetes.\n",
    "You can find it in this URL:\n",
    "\n",
    "https://www.kaggle.com/uciml/pima-indians-diabetes-database\n",
    "\n",
    "We have downloaded the dataset and we have uploaded it to the repository of the course. You can find it in the following URL:\n",
    "\n",
    "https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv'\n",
    "\n",
    "The dataset contains the following columns:\n",
    "\n",
    "* Pregnancies: Number of times pregnant\n",
    "* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test\n",
    "* BloodPressure: Diastolic blood pressure (mm Hg)\n",
    "* SkinThickness: Triceps skin fold thickness (mm)\n",
    "* Insulin: 2-Hour serum insulin (mu U/ml)\n",
    "* BMI: Body mass index (weight in kg/(height in m)^2)\n",
    "* DiabetesPedigreeFunction: Diabetes pedigree function\n",
    "* Age: Age (years)\n",
    "* Outcome: Class variable (0 or 1)\n",
    "* 268 of 768 are 1, the others are 0\n",
    "* Class Distribution: (class value 1 is interpreted as \"tested positive for diabetes\")\n",
    "\n",
    "The following code loads the dataset into a Pandas dataframe:"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "import pandas as pd\n",
    "diabetes_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')\n",
    "diabetes_pd"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    },
    "ExecuteTime": {
     "end_time": "2025-12-25T18:44:29.062132Z",
     "start_time": "2025-12-25T18:44:28.488505Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \\\n",
       "0              6      148             72             35        0  33.6   \n",
       "1              1       85             66             29        0  26.6   \n",
       "2              8      183             64              0        0  23.3   \n",
       "3              1       89             66             23       94  28.1   \n",
       "4              0      137             40             35      168  43.1   \n",
       "..           ...      ...            ...            ...      ...   ...   \n",
       "763           10      101             76             48      180  32.9   \n",
       "764            2      122             70             27        0  36.8   \n",
       "765            5      121             72             23      112  26.2   \n",
       "766            1      126             60              0        0  30.1   \n",
       "767            1       93             70             31        0  30.4   \n",
       "\n",
       "     DiabetesPedigreeFunction  Age  Outcome  \n",
       "0                       0.627   50        1  \n",
       "1                       0.351   31        0  \n",
       "2                       0.672   32        1  \n",
       "3                       0.167   21        0  \n",
       "4                       2.288   33        1  \n",
       "..                        ...  ...      ...  \n",
       "763                     0.171   63        0  \n",
       "764                     0.340   27        0  \n",
       "765                     0.245   30        0  \n",
       "766                     0.349   47        1  \n",
       "767                     0.315   23        0  \n",
       "\n",
       "[768 rows x 9 columns]"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Pregnancies</th>\n",
       "      <th>Glucose</th>\n",
       "      <th>BloodPressure</th>\n",
       "      <th>SkinThickness</th>\n",
       "      <th>Insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>DiabetesPedigreeFunction</th>\n",
       "      <th>Age</th>\n",
       "      <th>Outcome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>148</td>\n",
       "      <td>72</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>33.6</td>\n",
       "      <td>0.627</td>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>85</td>\n",
       "      <td>66</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>26.6</td>\n",
       "      <td>0.351</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>183</td>\n",
       "      <td>64</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>23.3</td>\n",
       "      <td>0.672</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>66</td>\n",
       "      <td>23</td>\n",
       "      <td>94</td>\n",
       "      <td>28.1</td>\n",
       "      <td>0.167</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>137</td>\n",
       "      <td>40</td>\n",
       "      <td>35</td>\n",
       "      <td>168</td>\n",
       "      <td>43.1</td>\n",
       "      <td>2.288</td>\n",
       "      <td>33</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>763</th>\n",
       "      <td>10</td>\n",
       "      <td>101</td>\n",
       "      <td>76</td>\n",
       "      <td>48</td>\n",
       "      <td>180</td>\n",
       "      <td>32.9</td>\n",
       "      <td>0.171</td>\n",
       "      <td>63</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>764</th>\n",
       "      <td>2</td>\n",
       "      <td>122</td>\n",
       "      <td>70</td>\n",
       "      <td>27</td>\n",
       "      <td>0</td>\n",
       "      <td>36.8</td>\n",
       "      <td>0.340</td>\n",
       "      <td>27</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>765</th>\n",
       "      <td>5</td>\n",
       "      <td>121</td>\n",
       "      <td>72</td>\n",
       "      <td>23</td>\n",
       "      <td>112</td>\n",
       "      <td>26.2</td>\n",
       "      <td>0.245</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>766</th>\n",
       "      <td>1</td>\n",
       "      <td>126</td>\n",
       "      <td>60</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>30.1</td>\n",
       "      <td>0.349</td>\n",
       "      <td>47</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>767</th>\n",
       "      <td>1</td>\n",
       "      <td>93</td>\n",
       "      <td>70</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "      <td>30.4</td>\n",
       "      <td>0.315</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>768 rows × 9 columns</p>\n",
       "</div>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 1
  },
  {
   "cell_type": "markdown",
   "source": [
    "1. Display the statistical summary of the dataset. What can you say about the dataset?"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "2. Use the query function to find the average BMI of the patients with diabetes and the average BMI of the patients without diabetes. What can you say about the results?\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "For the next exercises, we are going to use the datasets of Datahub, which is a platform for sharing and discovering open data. First, we are going to use the COVID dataset, which contains information about the COVID-19 pandemic. You can find it in the following URL:\n",
    "\n",
    "https://datahub.io/core/covid-1"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "outputs": [
    {
     "data": {
      "text/plain": "           Date   China        US  United_Kingdom    Italy    France  Germany  \\\n0    2020-01-22     548         1               0        0         0        0   \n1    2020-01-23     643         1               0        0         0        0   \n2    2020-01-24     920         2               0        0         2        0   \n3    2020-01-25    1406         2               0        0         3        0   \n4    2020-01-26    2075         5               0        0         3        0   \n..          ...     ...       ...             ...      ...       ...      ...   \n728  2022-01-19  118370  68684431        15610069  9219391  15288014  8361262   \n729  2022-01-20  118470  69329860        15718193  9418256  15715329  8502132   \n730  2022-01-21  118544  70209840        15814617  9603856  16116748  8635461   \n731  2022-01-22  118616  70495874        15891905  9781191  16506090  8716804   \n732  2022-01-23  118773  70699416        15966838  9923678  16807733  8773030   \n\n       Spain     Iran  \n0          0        0  \n1          0        0  \n2          0        0  \n3          0        0  \n4          0        0  \n..       ...      ...  \n728  8676916  6231909  \n729  8834363  6236567  \n730  8975458  6241843  \n731  8975458  6245346  \n732  8975458  6250490  \n\n[733 rows x 9 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Date</th>\n      <th>China</th>\n      <th>US</th>\n      <th>United_Kingdom</th>\n      <th>Italy</th>\n      <th>France</th>\n      <th>Germany</th>\n      <th>Spain</th>\n      <th>Iran</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2020-01-22</td>\n      <td>548</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2020-01-23</td>\n      <td>643</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2020-01-24</td>\n      <td>920</td>\n      <td>2</td>\n      <td>0</td>\n      <td>0</td>\n      <td>2</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>2020-01-25</td>\n      <td>1406</td>\n      <td>2</td>\n      <td>0</td>\n      <td>0</td>\n      <td>3</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>2020-01-26</td>\n      <td>2075</td>\n      <td>5</td>\n      <td>0</td>\n      <td>0</td>\n      <td>3</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>728</th>\n      <td>2022-01-19</td>\n      <td>118370</td>\n      <td>68684431</td>\n      <td>15610069</td>\n      <td>9219391</td>\n      <td>15288014</td>\n      <td>8361262</td>\n      <td>8676916</td>\n      <td>6231909</td>\n    </tr>\n    <tr>\n      <th>729</th>\n      <td>2022-01-20</td>\n      <td>118470</td>\n      <td>69329860</td>\n      <td>15718193</td>\n      <td>9418256</td>\n      <td>15715329</td>\n      <td>8502132</td>\n      <td>8834363</td>\n      <td>6236567</td>\n    </tr>\n    <tr>\n      <th>730</th>\n      <td>2022-01-21</td>\n      <td>118544</td>\n      <td>70209840</td>\n      <td>15814617</td>\n      <td>9603856</td>\n      <td>16116748</td>\n      <td>8635461</td>\n      <td>8975458</td>\n      <td>6241843</td>\n    </tr>\n    <tr>\n      <th>731</th>\n      <td>2022-01-22</td>\n      <td>118616</td>\n      <td>70495874</td>\n      <td>15891905</td>\n      <td>9781191</td>\n      <td>16506090</td>\n      <td>8716804</td>\n      <td>8975458</td>\n      <td>6245346</td>\n    </tr>\n    <tr>\n      <th>732</th>\n      <td>2022-01-23</td>\n      <td>118773</td>\n      <td>70699416</td>\n      <td>15966838</td>\n      <td>9923678</td>\n      <td>16807733</td>\n      <td>8773030</td>\n      <td>8975458</td>\n      <td>6250490</td>\n    </tr>\n  </tbody>\n</table>\n<p>733 rows × 9 columns</p>\n</div>"
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Uncomment to use the version in datahub\n",
    "# covid_pd = pd.read_csv('https://datahub.io/core/covid-19/r/countries-aggregated.csv')\n",
    "covid_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/covid.csv')\n",
    "\n",
    "covid_pd"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "3. Use the data functions to create another column with the month of the date.\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "4. Now, group the data by month and country and create a dataset containing the total number of confirmed cases and deaths for each month and country. What can you say about the results?"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "5. For this exercisee, we are going to use another public repository of data, called [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). In this repository, you can find a lot of datasets for different purposes. We are going to use the dataset of the [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) dataset.\n",
    "Download the dataset as a CSV file, then import it into a Pandas dataframe. What are the columns of the dataset? Display the statistical summary."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n",
     "is_executing": true
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}